from PIL import Image
img = Image.open("netflix.png")
img
Getting the Data
import pandas as pd
import numpy as np
netflix = pd.read_csv('netflix_titles.csv')
pd.set_option('display.max_columns',None) # display all the features
netflix.head(5)
| show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | s1 | Movie | Dick Johnson Is Dead | Kirsten Johnson | NaN | United States | September 25, 2021 | 2020 | PG-13 | 90 min | Documentaries | As her father nears the end of his life, filmm... |
| 1 | s2 | TV Show | Blood & Water | NaN | Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... | South Africa | September 24, 2021 | 2021 | TV-MA | 2 Seasons | International TV Shows, TV Dramas, TV Mysteries | After crossing paths at a party, a Cape Town t... |
| 2 | s3 | TV Show | Ganglands | Julien Leclercq | Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... | NaN | September 24, 2021 | 2021 | TV-MA | 1 Season | Crime TV Shows, International TV Shows, TV Act... | To protect his family from a powerful drug lor... |
| 3 | s4 | TV Show | Jailbirds New Orleans | NaN | NaN | NaN | September 24, 2021 | 2021 | TV-MA | 1 Season | Docuseries, Reality TV | Feuds, flirtations and toilet talk go down amo... |
| 4 | s5 | TV Show | Kota Factory | NaN | Mayur More, Jitendra Kumar, Ranjan Raj, Alam K... | India | September 24, 2021 | 2021 | TV-MA | 2 Seasons | International TV Shows, Romantic TV Shows, TV ... | In a city of coaching centers known to train I... |
Data Information
print(netflix.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8807 entries, 0 to 8806 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 show_id 8807 non-null object 1 type 8807 non-null object 2 title 8807 non-null object 3 director 6173 non-null object 4 cast 7982 non-null object 5 country 7976 non-null object 6 date_added 8797 non-null object 7 release_year 8807 non-null int64 8 rating 8803 non-null object 9 duration 8804 non-null object 10 listed_in 8807 non-null object 11 description 8807 non-null object dtypes: int64(1), object(11) memory usage: 825.8+ KB None
netflix.describe()
| release_year | |
|---|---|
| count | 8807.000000 |
| mean | 2014.180198 |
| std | 8.819312 |
| min | 1925.000000 |
| 25% | 2013.000000 |
| 50% | 2017.000000 |
| 75% | 2019.000000 |
| max | 2021.000000 |
print(netflix.duplicated().value_counts())
netflix.drop_duplicates(inplace = True)
print(len(netflix))
False 8807 dtype: int64 8807
print('Data columns with null values:\n',
netflix.isnull().sum())
Data columns with null values: show_id 0 type 0 title 0 director 2634 cast 825 country 831 date_added 10 release_year 0 rating 4 duration 3 listed_in 0 description 0 dtype: int64
netflix.nunique()
show_id 8807 type 2 title 8807 director 4528 cast 7692 country 748 date_added 1767 release_year 74 rating 17 duration 220 listed_in 514 description 8775 dtype: int64
print('type:\n',list(netflix['type'].unique()))
print('*'*70)
print('\ntitle:\n',list(netflix['title'].unique()))
print('*'*70)
print('\ndirector:\n',list(netflix['director'].unique()))
print('*'*70)
print('\ncast:\n',list(netflix['cast'].unique()))
print('*'*70)
print('\ncountry:\n',list(netflix['country'].unique()))
print('*'*70)
print('\ndate_added:\n',list(netflix['date_added'].unique()))
print('*'*70)
print('\nrelease_year:\n',list(netflix['release_year'].unique()))
print('*'*70)
print('\nrating:\n',list(netflix['rating'].unique()))
print('*'*70)
print('\nduration:\n',list(netflix['duration'].unique()))
print('*'*70)
print('\nlisted_in:\n',list(netflix['listed_in'].unique()))
print('*'*70)
print('\ndescription:\n',list(netflix['description'].unique()))
print('*'*70)
The above code will show all the unique values, as it will generate a huge data so we are not executing this right now
netflix=netflix.drop(['show_id'],axis=1)
As we have seen already that in some features the null values are very few so we can remove those rows directly, it will hadrly effect our overall analysis. Below are the features that we are removing:
We have already seen that in another three columns, there is a huge chunk on missing values. By dropping those rows from tha dataset will negetively effect our further analysis, so instead of removing those null values we are going to replace them with the keyword "unknown". Below are the features that we are going to replace values:
netflix.dropna(subset=['date_added'],how='any',inplace=True) # droping null value rows of "date_added" column
netflix.dropna(subset=['rating'],how='any',inplace=True) # droping null value rows of "rating" column
netflix.dropna(subset=['duration'],how='any',inplace=True) # droping null value rows of "duration" column
netflix['director'].replace(np.nan,'unknown',inplace=True) # replacing NaN value with "unknown"
netflix['cast'].replace(np.nan,'unknown',inplace=True) # replacing NaN value with "unknown"
netflix['country'].replace(np.nan,'unknown',inplace=True) # replacing NaN value with "unknown"
print('Data columns with null values:\n',
netflix.isnull().sum())
Data columns with null values: type 0 title 0 director 0 cast 0 country 0 date_added 0 release_year 0 rating 0 duration 0 listed_in 0 description 0 dtype: int64
netflix_white_spacefree = netflix.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
netflix_lower = netflix_white_spacefree.apply(lambda x: x.astype(str).str.lower())
netflix_lower.head(5)
| type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | movie | dick johnson is dead | kirsten johnson | unknown | united states | september 25, 2021 | 2020 | pg-13 | 90 min | documentaries | as her father nears the end of his life, filmm... |
| 1 | tv show | blood & water | unknown | ama qamata, khosi ngema, gail mabalane, thaban... | south africa | september 24, 2021 | 2021 | tv-ma | 2 seasons | international tv shows, tv dramas, tv mysteries | after crossing paths at a party, a cape town t... |
| 2 | tv show | ganglands | julien leclercq | sami bouajila, tracy gotoas, samuel jouy, nabi... | unknown | september 24, 2021 | 2021 | tv-ma | 1 season | crime tv shows, international tv shows, tv act... | to protect his family from a powerful drug lor... |
| 3 | tv show | jailbirds new orleans | unknown | unknown | unknown | september 24, 2021 | 2021 | tv-ma | 1 season | docuseries, reality tv | feuds, flirtations and toilet talk go down amo... |
| 4 | tv show | kota factory | unknown | mayur more, jitendra kumar, ranjan raj, alam k... | india | september 24, 2021 | 2021 | tv-ma | 2 seasons | international tv shows, romantic tv shows, tv ... | in a city of coaching centers known to train i... |
netflix_lower.drop_duplicates(subset ="title",keep = False, inplace = True)
netflix_purified = netflix_lower.copy()
netflix_purified['duration'] = netflix_purified['duration'].str.replace(' min','', regex=True).str.strip()
netflix_purified['duration'] = netflix_purified['duration'].str.replace('1 season','130', regex=True).str.strip()
netflix_purified['duration'] = netflix_purified['duration'].str.replace('2 seasons','360', regex=True).str.strip()
netflix_purified['duration'] = netflix_purified['duration'].str.replace('3 seasons','540', regex=True).str.strip()
netflix_purified['duration'] = netflix_purified['duration'].str.replace('4 seasons','420', regex=True).str.strip()
netflix_purified['duration'] = netflix_purified['duration'].str.replace('5 seasons','900', regex=True).str.strip()
netflix_purified['duration'] = netflix_purified['duration'].str.replace('6 seasons','1080', regex=True).str.strip()
netflix_purified['duration'] = netflix_purified['duration'].str.replace('7 seasons','1260', regex=True).str.strip()
netflix_purified['duration'] = netflix_purified['duration'].str.replace('8 seasons','1440', regex=True).str.strip()
netflix_purified['duration'] = netflix_purified['duration'].str.replace('9 seasons','1620', regex=True).str.strip()
netflix_purified['duration'] = netflix_purified['duration'].str.replace('10 seasons','1800', regex=True).str.strip()
netflix_purified['duration'] = netflix_purified['duration'].str.replace('11 seasons','1980', regex=True).str.strip()
netflix_purified['duration'] = netflix_purified['duration'].str.replace('12 seasons','2160', regex=True).str.strip()
netflix_purified['duration'] = netflix_purified['duration'].str.replace('13 seasons','2340', regex=True).str.strip()
netflix_purified['duration'] = netflix_purified['duration'].str.replace('14 seasons','2520', regex=True).str.strip()
netflix_purified['duration'] = netflix_purified['duration'].str.replace('15 seasons','2700', regex=True).str.strip()
netflix_purified['duration'] = netflix_purified['duration'].str.replace('16 seasons','2880', regex=True).str.strip()
netflix_purified['duration'] = netflix_purified['duration'].str.replace('17 seasons','3060', regex=True).str.strip()
netflix_purified['duration'] = netflix_purified['duration'].str.replace('18 seasons','3240', regex=True).str.strip()
netflix_purified['duration'] = netflix_purified['duration'].str.replace('19 seasons','3420', regex=True).str.strip()
netflix_purified.head(5)
| type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | movie | dick johnson is dead | kirsten johnson | unknown | united states | september 25, 2021 | 2020 | pg-13 | 90 | documentaries | as her father nears the end of his life, filmm... |
| 1 | tv show | blood & water | unknown | ama qamata, khosi ngema, gail mabalane, thaban... | south africa | september 24, 2021 | 2021 | tv-ma | 360 | international tv shows, tv dramas, tv mysteries | after crossing paths at a party, a cape town t... |
| 2 | tv show | ganglands | julien leclercq | sami bouajila, tracy gotoas, samuel jouy, nabi... | unknown | september 24, 2021 | 2021 | tv-ma | 130 | crime tv shows, international tv shows, tv act... | to protect his family from a powerful drug lor... |
| 3 | tv show | jailbirds new orleans | unknown | unknown | unknown | september 24, 2021 | 2021 | tv-ma | 130 | docuseries, reality tv | feuds, flirtations and toilet talk go down amo... |
| 4 | tv show | kota factory | unknown | mayur more, jitendra kumar, ranjan raj, alam k... | india | september 24, 2021 | 2021 | tv-ma | 360 | international tv shows, romantic tv shows, tv ... | in a city of coaching centers known to train i... |
netflix_purified['release_year']=netflix_purified['release_year'].astype(int)
print(netflix_purified.info())
<class 'pandas.core.frame.DataFrame'> Int64Index: 8778 entries, 0 to 8806 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 type 8778 non-null object 1 title 8778 non-null object 2 director 8778 non-null object 3 cast 8778 non-null object 4 country 8778 non-null object 5 date_added 8778 non-null object 6 release_year 8778 non-null int32 7 rating 8778 non-null object 8 duration 8778 non-null object 9 listed_in 8778 non-null object 10 description 8778 non-null object dtypes: int32(1), object(10) memory usage: 788.6+ KB None
netflix_purified.to_csv('netfilx_cleaned.csv')
Slight modification in some datas, keeping the nan values
netflix_purified['director'].replace('unknown',np.nan,inplace=True) # keeping the NaN values, replace later
netflix_purified['cast'].replace('unknown',np.nan,inplace=True) # keeping the NaN values, replace later
netflix_purified['country'].replace('unknown',np.nan,inplace=True) # keeping the NaN values, replace later
print('Data columns with null values:\n',
netflix_purified.isnull().sum())
Data columns with null values: type 0 title 0 director 2617 cast 825 country 829 date_added 0 release_year 0 rating 0 duration 0 listed_in 0 description 0 dtype: int64
import numpy as np
import pandas as pd
import plotly.express as px # for data visualization
To begin the task of analyzing Netflix data, I’ll start by looking at the distribution of content ratings on Netflix:
content = netflix_purified.groupby(['rating']).size().reset_index(name='counts')
pieChart = px.pie(content, values='counts', names='rating',
title='Distribution of Content Ratings on Netflix')
pieChart.show()
print(content)
rating counts 0 g 41 1 nc-17 3 2 nr 78 3 pg 287 4 pg-13 490 5 r 799 6 tv-14 2155 7 tv-g 220 8 tv-ma 3197 9 tv-pg 860 10 tv-y 306 11 tv-y7 333 12 tv-y7-fv 6 13 ur 3
The graph above shows that the majority of content on Netflix is categorized as “TV-MA”, which means that most of the content available on Netflix is intended for viewing by mature and adult audiences.
Now let’s see the top 5 countries using Netflix:
filtered_c=pd.DataFrame()
filtered_c=netflix_purified['country'].str.split(',',expand=True).stack()
filtered_c=filtered_c.to_frame()
filtered_c.columns=['Country']
c=filtered_c.groupby(['Country']).size().reset_index(name='Total Content')
#c=c[c.Country !='No Country Specified']
c=c.sort_values(by=['Total Content'],ascending=False)
cTop5=c.head()
cTop5=cTop5.sort_values(by=['Total Content'])
fig5=px.bar(cTop5,x='Total Content',y='Country',title='Top 5 Countries on Netflix')
fig5.show()
From the above graph it is derived that the top 5 countries on this platform are:
Now let’s see the top 5 successful directors on this platform:
netflix_purified['director']=netflix_purified['director'].fillna('No Director Specified')
filtered_directors=pd.DataFrame()
filtered_directors=netflix_purified['director'].str.split(',',expand=True).stack()
filtered_directors=filtered_directors.to_frame()
filtered_directors.columns=['Director']
directors=filtered_directors.groupby(['Director']).size().reset_index(name='Total Content')
directors=directors[directors.Director !='No Director Specified']
directors=directors.sort_values(by=['Total Content'],ascending=False)
directorsTop5=directors.head()
directorsTop5=directorsTop5.sort_values(by=['Total Content'])
fig1=px.bar(directorsTop5,x='Total Content',y='Director',title='Top 5 Directors on Netflix')
fig1.show()
From the above graph it is derived that the top 5 directors on this platform are:
Now let's see the top 5 successful Actors on this platform:
netflix_purified['cast']=netflix_purified['cast'].fillna('No Cast Specified')
filtered_cast=pd.DataFrame()
filtered_cast=netflix_purified['cast'].str.split(',',expand=True).stack()
filtered_cast=filtered_cast.to_frame()
filtered_cast.columns=['Actor']
actors=filtered_cast.groupby(['Actor']).size().reset_index(name='Total Content')
actors=actors[actors.Actor !='No Cast Specified']
actors=actors.sort_values(by=['Total Content'],ascending=False)
actorsTop5=actors.head()
actorsTop5=actorsTop5.sort_values(by=['Total Content'])
fig2=px.bar(actorsTop5,x='Total Content',y='Actor',title='Top 5 Actors on Netflix')
fig2.show()
From the above graph it is derived that the top 5 actors on this platform are:
The next thing to analyze from this data is the trend of production over the years on Netflix:
df1=netflix_purified[['type','release_year']]
df1=df1.rename(columns={"release_year": "Release Year"})
df2=df1.groupby(['Release Year','type']).size().reset_index(name='Total Content')
df2=df2[df2['Release Year']>=2010]
print(df2)
fig3=px.bar(df2,x='Release Year',y='Total Content',title='Top 5 Actors on Netflix')
fig3.show()
Release Year type Total Content 95 2010 movie 151 96 2010 tv show 39 97 2011 movie 145 98 2011 tv show 40 99 2012 movie 173 100 2012 tv show 63 101 2013 movie 225 102 2013 tv show 61 103 2014 movie 262 104 2014 tv show 88 105 2015 movie 396 106 2015 tv show 159 107 2016 movie 658 108 2016 tv show 243 109 2017 movie 763 110 2017 tv show 265 111 2018 movie 767 112 2018 tv show 377 113 2019 movie 633 114 2019 tv show 397 115 2020 movie 517 116 2020 tv show 436 117 2021 movie 277 118 2021 tv show 315
The above line graph shows that from 2011 content addition on Netflix started growing and touches its peak on year 2018 and after that there is a huge fall, it shows that Netfix slow down their process of content addition.
df1=netflix_purified[['type','release_year']]
df1=df1.rename(columns={"release_year": "Release Year"})
df2=df1.groupby(['Release Year','type']).size().reset_index(name='Total Content')
df2=df2[df2['Release Year']>=2010]
fig4 = px.line(df2, x="Release Year", y="Total Content", color='type',title='Trend of content produced over the years on Netflix')
fig4.show()
The above line graph shows that there has been a decline in the production of the content for movies since 2018 but for TV shows it gradually increases till 2020 and then there is a sharp decline after 2020.It shows Netfilx has more focus on TV shows. At last, to conclude our analysis, I will analyze the sentiment of content on Netflix:
! pip install textblob # This command may not work properly on iOS
Requirement already satisfied: textblob in c:\programdata\anaconda3\lib\site-packages (0.17.1) Requirement already satisfied: nltk>=3.1 in c:\programdata\anaconda3\lib\site-packages (from textblob) (3.7) Requirement already satisfied: click in c:\programdata\anaconda3\lib\site-packages (from nltk>=3.1->textblob) (8.0.4) Requirement already satisfied: joblib in c:\programdata\anaconda3\lib\site-packages (from nltk>=3.1->textblob) (1.1.0) Requirement already satisfied: tqdm in c:\programdata\anaconda3\lib\site-packages (from nltk>=3.1->textblob) (4.64.0) Requirement already satisfied: regex>=2021.8.3 in c:\programdata\anaconda3\lib\site-packages (from nltk>=3.1->textblob) (2022.7.9) Requirement already satisfied: colorama in c:\programdata\anaconda3\lib\site-packages (from click->nltk>=3.1->textblob) (0.4.5)
from textblob import TextBlob # for sentiment analysis, This command may not work properly on iOS
# This set of code may not work properly on iOS
dfx=netflix_purified[['release_year','description']]
dfx=dfx.rename(columns={'release_year':'Release Year'})
for index,row in dfx.iterrows():
z=row['description']
testimonial=TextBlob(z)
p=testimonial.sentiment.polarity
if p==0:
sent='Neutral'
elif p>0:
sent='Positive'
else:
sent='Negative'
dfx.loc[[index],'Sentiment']=sent
dfx=dfx.groupby(['Release Year','Sentiment']).size().reset_index(name='Total Content')
dfx=dfx[dfx['Release Year']>=2010]
fig4 = px.bar(dfx, x="Release Year", y="Total Content", color="Sentiment", title="Sentiment of content on Netflix")
fig4.show()
So the above graph shows that the overall positive content is always greater than the neutral and negative content combined.